Support Vector Machines

Key Concepts

Margin & boundary — We are going to edscribe Support Vector Machines for binary classification. The goal is to find a hyperplane that separates two classes with the maximum margin, i.e. the widest possible gap between the two classes. The points sitting exactly on the margin boundaries are called support vectors, and they’re the only points that matter for defining the boundary (all other points could be removed and the boundary wouldn’t change).

Kernel trick — What if the data isn’t linearly separable? The kernel trick implicitly maps the data into a higher-dimensional space where a flat hyperplane can separate it. An RBF (Gaussian) kernel effectively projects data into infinite dimensions. You never compute the actual coordinates in that space — the “trick” is that you only ever need dot products, which the kernel function computes directly.

Soft margin (C) — Real data has noise and outliers. The regularization parameter C controls the tradeoff: - Low C → wide margin, tolerates violations (points inside or across the margin) - High C → narrow margin, penalizes violations heavily, fits training data more tightly

SVMs are especially powerful in high-dimensional spaces (text classification, genomics) and when you have more features than training examples. Their main limitation is that they don’t scale well to very large datasets, where neural networks typically win.

Mathematical Formulation

1. Hard Margin SVM (linearly separable data)

The goal is to find a hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\) that separates two classes with the maximum margin.

Primal optimization problem:

\[\min_{\mathbf{w},\, b} \quad \frac{1}{2} \|\mathbf{w}\|^2\]

\[\text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall\, i\]

The margin width is \(\dfrac{2}{\|\mathbf{w}\|}\), so minimizing \(\|\mathbf{w}\|^2\) maximizes the margin.

2. Soft Margin SVM (non-separable data, regularization parameter C)

Slack variables \(\xi_i \geq 0\) allow points to violate the margin.

Primal problem:

\[\min_{\mathbf{w},\, b,\, \boldsymbol{\xi}} \quad \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i\]

\[\text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \quad \forall\, i\]

\(\xi_i = 0\): point is correctly classified and outside the margin
\(0 < \xi_i \leq 1\): point is inside the margin but correctly classified
\(\xi_i > 1\): point is misclassified

3. Dual Formulation (both hard and soft margin)

Introducing Lagrange multipliers \(\alpha_i \geq 0\), the dual problem is:

\[\max_{\boldsymbol{\alpha}} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j\]

\[\text{subject to} \quad \sum_{i=1}^{n} \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C \quad \forall\, i\]

For hard margin, the upper bound is \(\alpha_i \leq \infty\) (i.e., no upper constraint). The solution gives:

\[\mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i\]

Only points with \(\alpha_i > 0\) are support vectors.

4. Kernel SVM (non-linear boundaries)

Replace every dot product \(\mathbf{x}_i^\top \mathbf{x}_j\) with a kernel function \(k(\mathbf{x}_i, \mathbf{x}_j)\):

\[\max_{\boldsymbol{\alpha}} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j\, y_i y_j\, k(\mathbf{x}_i, \mathbf{x}_j)\]

The decision function becomes:

\[f(\mathbf{x}) = \text{sign}\!\left(\sum_{i \in \text{SV}} \alpha_i\, y_i\, k(\mathbf{x}_i, \mathbf{x}) + b\right)\]

Common kernels:

Kernel	Formula
Linear	\(k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j\)
Polynomial	\(k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j + c)^d\)
RBF (Gaussian)	\(k(\mathbf{x}_i, \mathbf{x}_j) = \exp\!\left(-\gamma \\|\mathbf{x}_i - \mathbf{x}_j\\|^2\right)\)
Sigmoid	\(k(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\kappa\, \mathbf{x}_i^\top \mathbf{x}_j + c)\)

The RBF kernel with \(\gamma > 0\) implicitly maps into an infinite-dimensional feature space, which is why it can separate arbitrarily complex boundaries.

SVM Geometry

Setting up the two margin hyperplanes

The decision boundary is \(\mathbf{w}^\top \mathbf{x} + b = 0\). The SVM convention defines the two margin boundaries as the planes where the classifier output is exactly \(+1\) and \(-1\):

\[\mathbf{w}^\top \mathbf{x} + b = +1 \qquad \text{(positive margin plane)}\] \[\mathbf{w}^\top \mathbf{x} + b = -1 \qquad \text{(negative margin plane)}\]

The margin width is the perpendicular distance between these two planes.

Computing the distance between two parallel planes

Take any point \(\mathbf{x}_+\) on the positive plane and any point \(\mathbf{x}_-\) on the negative plane:

\[\mathbf{w}^\top \mathbf{x}_+ + b = +1\] \[\mathbf{w}^\top \mathbf{x}_- + b = -1\]

Subtracting:

\[\mathbf{w}^\top(\mathbf{x}_+ - \mathbf{x}_-) = 2\]

The vector \((\mathbf{x}_+ - \mathbf{x}_-)\) points from one margin plane to the other, but it may be oblique. We only want the perpendicular component — i.e., the projection onto the unit normal of the planes.

The unit normal to both planes is \(\hat{\mathbf{n}} = \dfrac{\mathbf{w}}{\|\mathbf{w}\|}\), so the margin width \(m\) is:

\[m = \hat{\mathbf{n}}^\top (\mathbf{x}_+ - \mathbf{x}_-) = \frac{\mathbf{w}^\top(\mathbf{x}_+ - \mathbf{x}_-)}{\|\mathbf{w}\|}\]

Substituting the result from above:

\[\boxed{m = \frac{2}{\|\mathbf{w}\|}}\]

Why \(\mathbf{w}\) is the normal direction

This follows directly from the hyperplane equation. For any two points \(\mathbf{x}_a\), \(\mathbf{x}_b\) lying on the same plane \(\mathbf{w}^\top \mathbf{x} + b = c\):

\[\mathbf{w}^\top \mathbf{x}_a + b = c \quad \text{and} \quad \mathbf{w}^\top \mathbf{x}_b + b = c\]

Subtracting gives \(\mathbf{w}^\top(\mathbf{x}_a - \mathbf{x}_b) = 0\), meaning \(\mathbf{w}\) is orthogonal to every vector lying in the plane — by definition, that makes \(\mathbf{w}\) the normal.

Why the constraints are set to \(\pm 1\) specifically

You might wonder: why \(\pm 1\) and not \(\pm 7\) or \(\pm k\)? It’s a normalization choice. If you scaled \(\mathbf{w} \to s\mathbf{w}\) and \(b \to sb\), the hyperplane \(\mathbf{w}^\top \mathbf{x} + b = 0\) would be unchanged, but the margin planes would shift. Fixing the output at \(\pm 1\) for support vectors removes this degree of freedom and pins down a unique \(\mathbf{w}\). Any other constant \(\pm k\) would give an equivalent problem — the \(k\) would cancel out in the margin formula, still yielding \(\dfrac{2}{\|\mathbf{w}\|}\).

Hard Margin SVM Optimization

This is a constrained quadratic program, solved via Lagrangian duality. Here’s the full derivation.

Step 1 — The primal problem

\[\min_{\mathbf{w},\, b} \quad \frac{1}{2}\|\mathbf{w}\|^2 \qquad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall\, i\]

The objective is convex (quadratic) and the constraints are linear, so this is a convex QP — it has a unique global minimum.

Step 2 — Form the Lagrangian

Introduce one Lagrange multiplier \(\alpha_i \geq 0\) per constraint, one for each training point. The Lagrangian is:

\[\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n} \alpha_i \left[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 \right]\]

The idea: if a constraint is violated, the penalty term \(\alpha_i[\cdots]\) grows, pushing the solution back into the feasible region. At the optimum, the two terms balance.

Step 3 — KKT stationarity conditions

At the optimum, the gradient of \(\mathcal{L}\) with respect to the primal variables must vanish.

With respect to \(\mathbf{w}\):

\[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i = \mathbf{0} \quad \Longrightarrow \quad \boxed{\mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i}\]

With respect to \(b\):

\[\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^{n} \alpha_i y_i = 0 \quad \Longrightarrow \quad \boxed{\sum_{i=1}^{n} \alpha_i y_i = 0}\]

These are not the solution yet — they are necessary conditions that the solution must satisfy.

Step 4 — Derive the dual problem

Substitute both conditions back into \(\mathcal{L}\) to eliminate \(\mathbf{w}\) and \(b\), turning the primal min into a dual max purely in \(\boldsymbol{\alpha}\).

Expanding \(\frac{1}{2}\|\mathbf{w}\|^2\) using \(\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\):

\[\frac{1}{2}\|\mathbf{w}\|^2 = \frac{1}{2}\left(\sum_i \alpha_i y_i \mathbf{x}_i\right)^\top\left(\sum_j \alpha_j y_j \mathbf{x}_j\right) = \frac{1}{2}\sum_i\sum_j \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j\]

Expanding the constraint term:

\[\sum_i \alpha_i y_i(\mathbf{w}^\top \mathbf{x}_i + b) = \sum_i\sum_j \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j + b\underbrace{\sum_i \alpha_i y_i}_{=\,0}\]

Substituting into \(\mathcal{L}\):

\[\mathcal{L} = \frac{1}{2}\sum_i\sum_j \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j - \sum_i\sum_j \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j + \sum_i \alpha_i\]

\[\boxed{\mathcal{L}_\text{dual}(\boldsymbol{\alpha}) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j}\]

The dual problem is therefore:

\[\max_{\boldsymbol{\alpha}} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j\, y_i y_j\, \mathbf{x}_i^\top \mathbf{x}_j\]

\[\text{subject to} \quad \alpha_i \geq 0, \quad \sum_{i=1}^{n} \alpha_i y_i = 0\]

This is again a convex QP, but now in \(n\) variables (\(\boldsymbol{\alpha}\)) instead of \(d+1\) variables (\(\mathbf{w}, b\)) — a major advantage when \(d \gg n\).

Step 5 — KKT complementary slackness

The remaining KKT condition is:

\[\alpha_i \left[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 \right] = 0 \quad \forall\, i\]

This means for every point, exactly one of the following holds:

Case	Meaning
\(\alpha_i = 0\)	Point is outside the margin — ignored entirely
\(\alpha_i > 0\)	Point lies exactly on the margin — it is a support vector

This is why only support vectors matter: all other \(\alpha_i = 0\), so they drop out of \(\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\) completely.

Step 6 — Recovering \(b\)

Once \(\boldsymbol{\alpha}\) is found, \(\mathbf{w}\) follows from Step 3. To find \(b\), pick any support vector \(\mathbf{x}_s\) (where \(\alpha_s > 0\)) and use the fact that it satisfies \(y_s(\mathbf{w}^\top \mathbf{x}_s + b) = 1\):

\[b = y_s - \mathbf{w}^\top \mathbf{x}_s\]

In practice, average over all support vectors for numerical stability:

\[b = \frac{1}{|\mathcal{S}|}\sum_{s \in \mathcal{S}} \left( y_s - \mathbf{w}^\top \mathbf{x}_s \right)\]

Why strong duality holds (and why this matters)

Because the primal is convex with linear constraints, Slater’s condition is satisfied — strong duality holds, meaning the dual maximum equals the primal minimum exactly. There is no duality gap. Solving the dual is not an approximation; it gives the exact same solution as solving the primal directly.

In practice, the dual is preferred because (a) it only involves dot products \(\mathbf{x}_i^\top \mathbf{x}_j\), which can be replaced by kernels, and (b) algorithms like SMO (Sequential Minimal Optimization) can solve the dual very efficiently by updating two \(\alpha_i\) at a time while maintaining the equality constraint \(\sum_i \alpha_i y_i = 0\).